# Loading see data from BigQuery

## Your Cloud Project ID
You'll need a Google Cloud project with BigQuery enabled (it's enabled by default) for this notebook and associated code to work. Put your project ID below. Go to [cloud.google.com](http://cloud.google.com) to create one if you don't already have an account. You can create the Cloud account for free and won't be auto-billed. Then copy your Project ID and paste it below into `bq_project`.

In [1]:
bq_project = 'patent-embeddings'
#setup authentication with service account instead of user account
%env GOOGLE_APPLICATION_CREDENTIALS=C:\Users\tskripnikova\Documents\Patent embeddings-606d19a2d0cf.json
    
%load_ext autoreload
%autoreload 2

env: GOOGLE_APPLICATION_CREDENTIALS=C:\Users\tskripnikova\Documents\Patent embeddings-606d19a2d0cf.json


## Basic Configuration

In [2]:
import tensorflow as tf
import pandas as pd
import os

#seed_name = 'hair_dryer'
#seed_name = 'video_codec'
#seed_name = "contact_lens"
#seed_name = "contact_lens_us_c"
seed_name = "3d_printer"

seed_file = 'seeds/'+ seed_name + '.seed.csv'

src_dir = "."

patent_dataset = 'patents-public-data:patents.publications_latest'
num_anti_seed_patents = 15000
if bq_project == '':
    raise Exception('You must enter a bq_project above for this code to run.')

## Patent Landscape Expansion

This section of the notebook creates an instance of the `PatentLandscapeExpander`, which accesses a BigQuery table of patent data to do the expansion of a provided seed set and produces each expansion level as well as the final training dataset as a Pandas dataframe.

In [3]:
import fiz_lernmodule.expansion

expander = fiz_lernmodule.expansion.PatentLandscapeExpander(
    seed_file,
    seed_name,
    bq_project=bq_project,
    patent_dataset=patent_dataset,
    num_antiseed=num_anti_seed_patents,
    us_only=True,
    prepare_training=False)


This does the actual expansion and displays the head of the final training data dataframe.

In [4]:
%%time

training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents = \
    expander.load_from_disk_or_do_expansion()

Loading landscape data from BigQuery.
Loaded 3152 seed publication numbers


  **kwargs)


Loaded 2854 seed patents from BigQuery
Loading training data text from (2854, 2) publication numbers
Loading dataframe with cols Index(['publication_number'], dtype='object'), shape (2854, 1), to patents._tmp_training


  stacklevel=1,
1it [00:04,  4.52s/it]


Completed loading temp table.
Loading patent texts from provided publication numbers.
(2595, 12)
Merging labels into training data.
Saving landscape data to data\contact_lens_us_c\landscape_data.pkl.
Wall time: 58.2 s


In [5]:
training_data_full_df.head()

Unnamed: 0,pub_num,publication_number,country_code,family_id,priority_date,title_text,abstract_text,claims_text,refs,cpcs,ipcs,assignees_harmonized,ExpansionLevel
0,6536898,US-6536898-B1,US,24663616,20000915,Extended depth of field optics for human vision,The present invention provides extended depth ...,What is claimed is: \n \n 1. Appara...,"US-5748371-A,US-5476515-A,","A61F2002/1699,A61F2250/0036,A61F2/1624,G02C7/0...","A61F2/16,A61F2/14,G02B3/00,G02C7/04",UNIV COLORADO,Seed
1,2014036225,US-2014036225-A1,US,48875618,20120731,Lens incorporating myopia control optics and m...,"Ophthalmic devices, such as contact lenses, ma...",What is claimed is: \n \n 1 . An o...,"US-2008194481-A1,US-7637612-B2,US-2010239637-A...","A61K31/5513,G02C7/04,G02C7/041,A61K31/46,G02C2...",G02C7/04,"SHEDDEN JR ARTHUR H,CHENG XU,CHEHAB KHALED",Seed
2,2017164704,US-2017164704-A1,US,34915775,20020817,Packaging for Disposable Soft Contact Lenses,The present disclosure provides a contact lens...,What is claimed is: \n \n 1 . A si...,"US-4782942-A,US-2002175177-A1,US-3610516-A,GB-...","B65D75/30,B65D83/005,B65D75/32,A45C11/005,B65D...","B65D81/22,B65D75/52,B65D85/38,B65D75/28,B65D75...",MENICON SINGAPORE PTE LTD,Seed
3,5401431,US-5401431-A,US,17709917,19921001,Cleaning-preserving aqueous solution for conta...,A cleaning-preserving aqueous solution for con...,We claim: \n \n 1. A cleaning-preser...,"JP-H0368503-A,JP-H02115116-A,JP-H04342508-A,JP...","C11D3/0078,C11D1/74,A61L12/04","G02C13/00,C11D3/00,C11D1/722,C11D1/74,A61L2/04...",TOMEI SANGYO KK,Seed
4,2011019148,US-2011019148-A1,US,43497041,20090727,Multifocal diffractive contact lens with bi-si...,A contact lens for placing over the eye is des...,1 . An optic comprising a contact lens having ...,"US-4340283-A,US-5054905-A,US-5114483-A,US-4655...","G02C7/042,G02C2202/20,G02C7/041",G02C7/04,PORTNEY VALDEMAR,Seed


### Show some stats about the landscape training data

In [6]:
print('Seed/Positive examples:')
print(training_data_full_df[training_data_full_df.ExpansionLevel == 'Seed'].count())

print('\n\nAnti-Seed/Negative examples:')
print(training_data_full_df[training_data_full_df.ExpansionLevel == 'AntiSeed'].count())

Seed/Positive examples:
pub_num                 2595
publication_number      2595
country_code            2595
family_id               2595
priority_date           2595
title_text              2595
abstract_text           2595
claims_text             2595
refs                    2595
cpcs                    2595
ipcs                    2595
assignees_harmonized    2595
ExpansionLevel          2595
dtype: int64


Anti-Seed/Negative examples:
pub_num                 0
publication_number      0
country_code            0
family_id               0
priority_date           0
title_text              0
abstract_text           0
claims_text             0
refs                    0
cpcs                    0
ipcs                    0
assignees_harmonized    0
ExpansionLevel          0
dtype: int64
