pandas.io.json.json_normalize with very nested json

programing

pandas.io.json.json_normalize with very nested json

minecode 2023. 3. 26. 12:48

pandas.io.json.json_normalize with very nested json

나는 하려고 노력해 왔다.normalize나중에 분석할 매우 중첩된 json 파일입니다.내가 고군분투하고 있는 것은 정상화를 위해 어떻게 한 단계 이상 깊이 들어가느냐는 것이다.

pandas.io.json.json_normalize 문서는 원하는 대로 동작하기 때문에 살펴보았습니다.

나는 그것의 일부를 정상화할 수 있었고 이제 사전이 어떻게 작동하는지 이해했지만, 나는 여전히 거기에 있지 않다.

아래 코드로는 첫 번째 레벨만 받을 수 있습니다.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('authors_sample.json') as f:
    d = json.load(f)

raw = json_normalize(d['hits']['hits'])

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = '_source', 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

아래 코드로 '작자' 사전을 '파헤치려고' 하는데record_path = ['_source', 'authors']나를 던지다TypeError: string indices must be integers내가 알기로는json_normalize논리는 좋아야 하는데, 난 아직도 어떻게 Json과 관계를 맺어야 할지 모르겠어.dict대list.

저는 이 간단한 예도 살펴보았습니다.

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = ['_source', 'authors'], 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

다음은 json 파일의 청크입니다(5개의 레코드).

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'7CB3F2AD',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': None,
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Physical Review Letters',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'phys rev lett',
     u'pages': None,
     u'parent_keywords': [u'Chromatography',
      u'Quantum mechanics',
      u'Particle physics',
      u'Quantum field theory',
      u'Analytical chemistry',
      u'Quantum chromodynamics',
      u'Physics',
      u'Mass spectrometry',
      u'Chemistry'],
     u'pub_date': u'1987-03-02 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'mass spectra', u'elementary particles', u'bound states'],
     u'title': u'Evidence for a new meson: A quasinuclear NN-bar bound state',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Physical Review Letters',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7AF8EBC3',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'affiliations': [u'Punjabi University'],
       u'author_id': u'780E3459',
       u'author_name': u'munish puri'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'48D92C79',
       u'author_name': u'rajesh dhaliwal'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'7D9BD37C',
       u'author_name': u'r s singh'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Journal of Industrial Microbiology & Biotechnology',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'j ind microbiol biotechnol',
     u'pages': None,
     u'parent_keywords': [u'Nuclear medicine',
      u'Psychology',
      u'Hydrology',
      u'Chromatography',
      u'X-ray crystallography',
      u'Nuclear fusion',
      u'Medicine',
      u'Fluid dynamics',
      u'Thermodynamics',
      u'Physics',
      u'Gas chromatography',
      u'Radiobiology',
      u'Engineering',
      u'Organic chemistry',
      u'High-performance liquid chromatography',
      u'Chemistry',
      u'Organic synthesis',
      u'Psychotherapist'],
     u'pub_date': u'2008-04-04 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'flow rate',
      u'operant conditioning',
      u'packed bed reactor',
      u'immobilized enzyme',
      u'specific activity'],
     u'title': u'Development of a stable continuous flow immobilized enzyme reactor for the hydrolysis of inulin',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Journal of Industrial Microbiology & Biotechnology',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7521A721',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7FF872BC',
       u'author_name': u'barbara eileen ryan'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Social science',
      u'Politics',
      u'Sociology',
      u'Law'],
     u'pub_date': u'1992-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'social movements'],
     u'title': u"Feminism and the women's movement : dynamics of change in social movement ideology, and activism",
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7DAEB9A4',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'0299B8E9',
       u'author_name': u'fraser j harbutt'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Superconductivity',
      u'Nuclear fusion',
      u'Geology',
      u'Chemistry',
      u'Metallurgy'],
     u'pub_date': u'1988-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'iron'],
     u'title': u'The iron curtain : Churchill, America, and the origins of the Cold War',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7B3236C5',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7DAB7B72',
       u'author_name': u'richard m freeland'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Political Science', u'Economics'],
     u'pub_date': u'1985-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'foreign policy'],
     u'title': u'The Truman Doctrine and the origins of McCarthyism : foreign policy, domestic politics, and internal security, 1946-1948',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'}],
  u'max_score': 1.0,
  u'total': 36429433},
 u'timed_out': False,
 u'took': 170}

판다의 예(아래)에서 괄호는 무엇을 의미합니까?[...]에 대해 자세히 알아보기 위해 따라야 할 논리가 있습니까?
result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])

의 각 문자열 또는 문자열 목록['state', 'shortname', ['info', 'governor']]value는 선택한 행과 함께 포함할 요소의 경로입니다.두 번째 인수json_normalize()인수(record_path, 로 설정합니다.'counties'문서 예에서) 출력의 행을 구성하는 입력 데이터 구조에서 요소를 선택하는 방법을 함수에 알려 줍니다.metapaths는 각 행에 포함될 메타데이터를 추가합니다.데이터베이스 내의 테이블 조인이라고 생각하시면 됩니다.

US States Documentation 예제 입력에는 목록 내에 2개의 사전이 있으며 이러한 사전은 모두counties다른 딕트 목록을 참조하는 키:

>>> data = [{'state': 'Florida',
...          'shortname': 'FL',
...         'info': {'governor': 'Rick Scott'},
...         'counties': [{'name': 'Dade', 'population': 12345},
...                      {'name': 'Broward', 'population': 40000},
...                      {'name': 'Palm Beach', 'population': 60000}]},
...         {'state': 'Ohio',
...          'shortname': 'OH',
...          'info': {'governor': 'John Kasich'},
...          'counties': [{'name': 'Summit', 'population': 1234},
...                       {'name': 'Cuyahoga', 'population': 1337}]}]
>>> pprint(data[0]['counties'])
[{'name': 'Dade', 'population': 12345},
 {'name': 'Broward', 'population': 40000},
 {'name': 'Palm Beach', 'population': 60000}]
>>> pprint(data[1]['counties'])
[{'name': 'Summit', 'population': 1234},
 {'name': 'Cuyahoga', 'population': 1337}]

이들 사이에는 출력에 사용할 데이터 행이 5개 있습니다.

>>> json_normalize(data, 'counties')
         name  population
0        Dade       12345
1     Broward       40000
2  Palm Beach       60000
3      Summit        1234
4    Cuyahoga        1337

그meta인수는 그 옆에 있는 몇 가지 요소에 이름을 붙입니다.counties리스트와 리스트는 개별적으로 Marge 됩니다.첫 번째 값부터data[0]그들을 위한 사전meta요소는('Florida', 'FL', 'Rick Scott'), 및 의 경우data[1]가치관은('Ohio', 'OH', 'John Kasich')이러한 값이 에 부가되어 있는 것을 확인할 수 있습니다.counties동일한 최상위 사전에서 가져온 행이 각각 3회 및 2회 반복됩니다.

>>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']
('Florida', 'FL', 'Rick Scott')
>>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']
('Ohio', 'OH', 'John Kasich')
>>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

그래서, 만약 당신이 그 목록을 넘겨준다면metaargument를 지정하면 목록 내의 각 요소는 개별 경로이며 이러한 개별 경로는 출력의 행에 추가할 데이터를 식별합니다.

예제 JSON에서는 첫 번째 인수와 함께 상승할 중첩 목록이 몇 개밖에 없습니다.'counties'를 참조해 주세요.는 중첩된 입니다.'authors' 키, 키, 키, 뽑아야 ['_source', 'authors'] 키를 rowspath를 추가할 수.

번째 두 the themeta는 ""를 ._id객체의 , 된 키를 누릅니다.['_source', 'title'] ★★★★★★★★★★★★★★★★★」['_source', 'journal']네스트 패스

record_path는 ""를 합니다.authors

>>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped
>>> d['hits']['hits'][1]['_source']['authors']
[{'affiliations': ['Punjabi University'],
  'author_id': '780E3459',
  'author_name': 'munish puri'},
 {'affiliations': ['Punjabi University'],
  'author_id': '48D92C79',
  'author_name': 'rajesh dhaliwal'},
 {'affiliations': ['Punjabi University'],
  'author_id': '7D9BD37C',
  'author_name': 'r s singh'}]
>>> d['hits']['hits'][2]['_source']['authors']
[{'author_id': '7FF872BC',
  'author_name': 'barbara eileen ryan'}]
>>> # etc.

그러면 다음 행이 나타납니다.

>>> json_normalize(d['hits']['hits'], ['_source', 'authors'])
           affiliations author_id          author_name
0  [Punjabi University]  780E3459          munish puri
1  [Punjabi University]  48D92C79      rajesh dhaliwal
2  [Punjabi University]  7D9BD37C            r s singh
3                   NaN  7FF872BC  barbara eileen ryan
4                   NaN  0299B8E9     fraser j harbutt
5                   NaN  7DAB7B72   richard m freeland

에 세 번째 '세 '를 사용할 수 .meta을 _id,_source.title ★★★★★★★★★★★★★★★★★」_source.journal 를 사용합니다.['_id', ['_source', 'journal'], ['_source', 'title']]:

>>> json_normalize(
...     data['hits']['hits'],
...     ['_source', 'authors'],
...     ['_id', ['_source', 'journal'], ['_source', 'title']]
... )
           affiliations author_id          author_name       _id   \
0  [Punjabi University]  780E3459          munish puri  7AF8EBC3  
1  [Punjabi University]  48D92C79      rajesh dhaliwal  7AF8EBC3
2  [Punjabi University]  7D9BD37C            r s singh  7AF8EBC3
3                   NaN  7FF872BC  barbara eileen ryan  7521A721
4                   NaN  0299B8E9     fraser j harbutt  7DAEB9A4
5                   NaN  7DAB7B72   richard m freeland  7B3236C5

                                     _source.journal
0  Journal of Industrial Microbiology & Biotechno...
1  Journal of Industrial Microbiology & Biotechno...
2  Journal of Industrial Microbiology & Biotechno...
3                     The American Historical Review
4                     The American Historical Review
5                     The American Historical Review

                                       _source.title  \
0  Development of a stable continuous flow immobi...
1  Development of a stable continuous flow immobi...
2  Development of a stable continuous flow immobi...
3  Feminism and the women's movement : dynamics o...
4  The iron curtain : Churchill, America, and the...
5  The Truman Doctrine and the origins of McCarth...

json_normalize처럼 열 계층을 쓸 필요가 없는 라이브러리 flat_json도 볼 수 있습니다.

from flatten_json import flatten

data = d['hits']['hits']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df)

https://github.com/amirziai/flatten 를 참조해 주세요.

Sanders의 코멘트에 덧붙여, 이 기능의 작성자가 미디어 블로그를 가지고 있기 때문에, 여기를 참조해 주세요.https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10

팬더의 json_normalize는 어레이와 같은 대부분의 json 개체를 처리할 수 있습니다.flat_json 라이브러리에서는 nested dict여야 합니다.다만, 다음과 같이 어레이를 dict에 추가하는 것으로, 이 요건을 회피할 수 있습니다.

flatten({'response':data}, '.')

이 경우 flat_json 라이브러리는 실제로 중복과 구별하기 위해 도트 표기의 카운터를 사용합니다.예를 들어 다음과 같습니다.

flatten({
    'response': [
        {'metrics': {'clicks': '0', 'cost_micros': '0', 'impressions': '3'},
         'segments': {'date': '2022-12-01'}},
        {'metrics': {'clicks': '1', 'cost_micros': '609240', 'impressions': '358'},
         'segments': {'date': '2022-12-01'}},
        {'metrics': {'clicks': '0', 'cost_micros': '0', 'impressions': '3'},
         'segments': {'date': '2022-12-02'}},
        {'metrics': {'clicks': '2', 'cost_micros': '40000', 'impressions': '291'},
         'segments': {'date': '2022-12-02'}},
        {'metrics': {'clicks': '0', 'cost_micros': '0', 'impressions': '2'},
         'segments': {'date': '2022-12-03'}},
        {'metrics': {'clicks': '2', 'cost_micros': '337754', 'impressions': '241'},
         'segments': {'date': '2022-12-03'}},
        {'metrics': {'clicks': '0', 'cost_micros': '0', 'impressions': '4'},
         'segments': {'date': '2022-12-04'}},
        {'metrics': {'clicks': '2', 'cost_micros': '757299', 'impressions': '197'},
         'segments': {'date': '2022-12-04'}}
    ]
}, '.')

작성:

{'response.0.metrics.clicks': '0',
 'response.0.metrics.cost_micros': '0',
 'response.0.metrics.impressions': '3',
 'response.0.segments.date': '2022-12-01',
 'response.1.metrics.clicks': '1',
 'response.1.metrics.cost_micros': '609240',
 'response.1.metrics.impressions': '358',
 'response.1.segments.date': '2022-12-01',
 'response.2.metrics.clicks': '0',
 'response.2.metrics.cost_micros': '0',
 'response.2.metrics.impressions': '3',
 'response.2.segments.date': '2022-12-02',
 'response.3.metrics.clicks': '2',
 'response.3.metrics.cost_micros': '40000',
 'response.3.metrics.impressions': '291',
 'response.3.segments.date': '2022-12-02',
 'response.4.metrics.clicks': '0',
 'response.4.metrics.cost_micros': '0',
 'response.4.metrics.impressions': '2',
 'response.4.segments.date': '2022-12-03',
 'response.5.metrics.clicks': '2',
 'response.5.metrics.cost_micros': '337754',
 'response.5.metrics.impressions': '241',
 'response.5.segments.date': '2022-12-03',
 'response.6.metrics.clicks': '0',
 'response.6.metrics.cost_micros': '0',
 'response.6.metrics.impressions': '4',
 'response.6.segments.date': '2022-12-04',
 'response.7.metrics.clicks': '2',
 'response.7.metrics.cost_micros': '757299',
 'response.7.metrics.impressions': '197',
 'response.7.segments.date': '2022-12-04'}

언급URL : https://stackoverflow.com/questions/47242845/pandas-io-json-json-normalize-with-very-nested-json

저작자표시 (새창열림)

'programing' 카테고리의 다른 글

원인: org.hibernate.휴지 상태예외:Language Resolution 접근'hibernate.dialect'가 설정되지 않은 경우 정보는 null일 수 없습니다. (0)	2023.03.26
Angular에 대한 캔버스 그리기 지침이 이미 있습니까?밖에 JS? (0)	2023.03.26
Python 요청:JSON과 파일을 단일 요청으로 게시 (0)	2023.03.26
코틀린의 정적 초기화 블록 (0)	2023.02.03
사전에 키가 있는지 확인하려면 어떻게 해야 하나요? (0)	2023.02.03

현재글pandas.io.json.json_normalize with very nested json

각종 프로그래밍 정보를 다루는 블로그입니다.

Java, Spring3, C++, spring, vueX, vuejs2, JavaScript, c#, jQuery, C,

Today :
Yesterday :

minecode

pandas.io.json.json_normalize with very nested json

pandas.io.json.json_normalize with very nested json

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2026/07 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

pandas.io.json.json_normalize with very nested json

pandas.io.json.json_normalize with very nested json

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바